3. Identifying and Addressing Copyright Infringement
One of the best ways to monitor whether your site’s copy is being
duplicated elsewhere is to use CopyScape.com, a site that enables
you to instantly view pages on the Web that are using your content. Do
not worry if the pages of these sites are in the supplemental index or
rank far behind your own pages for any relevant queries—if any large,
authoritative, content-rich domain tried to fight all the copies of its
work on the Web, it would have at least two 40-hour-per-week jobs on its
hands. Luckily, the search engines have placed trust in these types of
sites to issue high-quality, relevant, worthy content, and therefore
recognize them as the original issuer.
If, on the other hand, you have a relatively new site or a site
with few inbound links, and the scrapers are consistently ranking ahead
of you (or someone with a powerful site is stealing your work), you’ve
got some recourse. One option is to file a DMCA infringement request
with Google, with Yahoo!, and with Bing (you should also file this
request with the site’s hosting company).
The other option is to file a legal suit (or threaten such)
against the website in question. If the site republishing your work has
an owner in your country, this latter course of action is probably the
wisest first step. You may want to try to start with a more informal
communication asking them to remove the content before you send a letter
from the attorneys, as the DMCA motions can take months to go into
effect; but if they are nonresponsive, there is no reason to delay
taking stronger action, either.
3.1. An actual penalty situation
The preceding examples show duplicate content filters and are
not actual penalties, but, for all practical purposes, they have the
same impact as a penalty: lower rankings for your pages. But there are
scenarios where an actual penalty can occur.
For example, sites that aggregate content from across the Web
can be at risk, particularly if little unique content is added from
the site itself. In this type of scenario, you might see the site
actually penalized.
The only fixes for this are to reduce the number of duplicate
pages accessible to the search engine crawler, either by deleting them
or NoIndexing the pages themselves,
or to add a substantial amount of unique content.
One example of duplicate content that may get filtered out on a
broad basis is a thin affiliate site. This
nomenclature frequently describes a site promoting the sale of someone
else’s products (to earn a commission), yet provides little or no new
information. Such a site may have received the descriptions from the
manufacturer of the products and simply replicated those descriptions
along with an affiliate link (so that it can earn credit when a
click/purchase is performed).
Search engineers have observed user data suggesting that, from a
searcher’s perspective, these sites add little value to their indexes.
Thus, the search engines attempt to filter out this type of site, or
even ban it from their index. Plenty of sites operate affiliate models
but also provide rich new content, and these sites generally have no
problem. It is when duplication of content and a lack of unique,
value-adding material come together on a domain that the engines may
take action.
4. How to Avoid Duplicate Content on Your Own Site
As we outlined, duplicate content can be created in many ways.
Internal duplication of material requires specific tactics to achieve
the best possible results from an SEO perspective. In many cases, the
duplicate pages are pages that have no value to either users or search
engines. If that is the case, try to eliminate the problem altogether by
fixing the implementation so that all pages are referred to by only one
URL. Also, 301-redirect the old URLs to the surviving URLs to help the
search engines discover what you have done as rapidly as possible, and
preserve any link juice the removed pages may have had.
If that process proves to be impossible, there are many options. Here is a summary
of the guidelines on the simplest solutions for dealing with a variety
of scenarios:
Use the canonical tag. This
is the next best solution to eliminating the duplicate pages.
Use robots.txt to block
search engine spiders from crawling the duplicate versions of pages
on your site.
Use the Robots NoIndex meta
tag to tell the search engine to not index the duplicate
pages.
NoFollow all the links to
the duplicate pages to prevent any link juice from going to those
pages. If you do this, it is still recommended that you NoIndex those pages as well.
You can sometimes use these tools in conjunction with one another.
For example, you can NoFollow the
links to a page and also NoIndex the
page itself. This makes sense because you are preventing the page from
getting link juice from your links, and if someone else links to your
page from another site (which you can’t control), you are still ensuring
that the page does not get into the index.
However, if you use robots.txt to prevent a page from being
crawled, be aware that using NoIndex
or NoFollow on the page itself does
not make sense, as the spider can’t read the page, so it will never see
the NoIndex or NoFollow tag. With these tools in mind, here
are some specific duplicate content scenarios:
HTTPS pages
If you make use of SSL (encrypted
communications between the browser and the web server often used
for e-commerce purposes), you will have pages on your site that
begin with https: instead of http:. The problem arises when the
links on your https: pages link back to other pages on the site
using relative instead of absolute links, so (for example) the
link to your home page becomes
https://www.yourdomain.com instead of
http://www.yourdomain.com.
If you have this type of issue on your site, you may want to
use the canonical URL tag or 301
redirects to resolve problems with these types of pages. An
alternative solution is to change the links to absolute links
(http://www.yourdomain.com/content.html
instead of “/content.html”), which also makes life more difficult
for content thieves that scrape your site.
CMSs that create duplicate content
Sometimes sites have many versions of identical pages
because of limitations in the CMS where it addresses the same
content with more than one URL. These are often unnecessary
duplications with no end-user value, and the best practice is to
figure out how to eliminate the duplicate pages and 301 the
eliminated pages to the surviving pages. Failing that, fall back
on the other options listed at the beginning of this
section.
Print pages or multiple sort orders
Many sites offer print pages to provide the user with the
same content in a more printer-friendly format. Or some e-commerce
sites offer their products in multiple sort orders (such as size,
color, brand, and price). These pages do have end-user value, but
they do not have value to the search engine and will appear to be
duplicate content. For that reason, use one of the options listed
previously in this subsection.
Duplicate content in blogs and multiple archiving systems
(pagination, etc.)
Blogs present some interesting duplicate content challenges.
Blog posts can appear on many different pages, such as the home
page of the blog, the Permalink page for the post, date archive
pages, and category pages. Each instance of the post represents
duplicates of the other instances. Once again, the solutions
listed earlier in this subsection are the ones to use in
addressing this problem.
User-generated duplicate content (repostings, etc.)
Many sites implement structures for obtaining user-generated
content, such as a blog, forum, or job board. This can be a great
way to develop large quantities of content at a very low cost. The
challenge is that users may choose to submit the same content on
your site and in several other sites at the same time, resulting
in duplicate content among those sites. It is hard to control
this, but there are two things you can do to reduce the
problem:
Have clear policies that notify users that the content
they submit to your site must be unique and cannot be, or
cannot have been, posted to other sites. This is difficult to
enforce, of course, but it will still help some to communicate
your expectations.
Implement your forum in a different and unique way that
demands different content. Instead of having only the standard
fields for entering data, include fields that are likely to be
unique over what other sites do, but that will still be
interesting and valuable for site visitors to see.